arXiv: 1503.0721 lvl [cs.LG] 24 Mar 2015 


Universal Approximation of Markov Kernels by 
Shallow Stochastic Feedforward Networks 


Guido Montufar montufar@mis.mpg.de 

Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany 


Abstract 

We establish upper bounds for the minimal number of hidden units for which a binary stochastic 
feedforward network with sigmoid activation probabilities and a single hidden layer is a universal 
approximator of Markov kernels. We show that each possible probabilistic assignment of the states 
of n output units, given the states of k > 1 input units, can be approximated arbitrarily well by a 
network with 2 fc_1 (2 n_1 — 1) hidden units. 
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1. Introduction 

The universal approximation capabilities of feedforward networks with one hidden layer of compu¬ 
tational units have been studied in numerous papers and have been established under quite general 
conditions on the activation functions and the input-output domains (Cybenko, 1989; Hornik et ah, 
1989; Leshno et ah, 1993; Chen and Chen, 1995; Gallant and White, 1988). Some works have also 
studied the minimal size of universal approximators and the quality of the approximations when the 
networks have only a limited number of hidden units (Hornik, 1991; Barron, 1993; Wenzel et al., 
2000 ). 

In the context of feedforward networks, the universal approximation question most commonly 
refers to the approximation of deterministic functions. In this paper we address a related problem 
that has received a bit less attention. We study the universal approximation of stochastic functions 
(Markov kernels) and the minimal number of hidden units in a stochastic feedforward network that 
is sufficient for this purpose. For a network with k input binary units and n output binary units, 
we are interested in maps taking inputs from {0, 1 } k to probability distributions over outputs from 
{0, l} n . The outputs of the network are length-n binary vectors, but the outputs of the stochastic 
maps are length-2 n probability vectors. We focus on shallow networks, with one single hidden 
layer, as the one illustrated in Figure 1, and stochastic binary units with output 1 probability given 
by the sigmoid of a weighted sum of their inputs. Given the number of input and output units, k and 
n, what is the smallest number of hidden units m that suffices to obtain a universal approximator of 
stochastic maps? We show that this is not more than 2 k ^ 1 (2 n ^ 1 — 1). We also consider the case 
where the weights of the output layer are fixed in advance and only the weights of the hidden layer 
are tunable. In that setting we show that 2 k ~ 1 (2 n — 1) hidden units are sufficient. 

Some previous works have discussed compact representations of stochastic maps by feedfor¬ 
ward networks, but focusing on the approximation of probability distributions (instead of Markov 
kernels) and deep networks with many hidden layers (Sutskever and Hinton, 2008; Le Roux and 
Bengio, 2010; Montufar and Ay, 2011). 
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This paper is organized as follows. Section 2 contains basic definitions and notations, as well 
as a few comments about the deterministic setting. Section 3 presents our main results. Section 4 
contains our analysis of the minimal number of hidden units for a network can approximate any 
stochastic function arbitrarily well by tuning the input weights and biases of the first layer, while the 
weights and biases of the second layer are kept fixed. Section 5 contains a corresponding analysis 
for the case when input weights and biases of both layers are tuned. Section 6 offers our conclusions 
and outlook. 


2. Settings 

This section contains definitions and basic observations. 


2.1. Probability Distributions and Markov Kernels 

Throughout this paper k, m, and n denote finite natural numbers. We denote by {0, l} n the set of 
length-n binary strings. This is the set of all possible configurations of n binary units. The set of 
probability distributions over {0, l} n is given by 

A n := | p £ : p(x) > 0, ^ p(x) = 1 j. (1) 

:ce{o,i} n 

This is the (2 n — 1)-dimensional simplex of all vectors in 2 n -dimensional Euclidean space with 
non-negative entries and 1-norm equal to 1. The set of strictly positive distributions is the relative 
interior of A n , denoted by A+. 

A Markov kernel with source {0, 1 } k and target {0,1}” is a map from {0, l} k to A n . The set 
of all such kernels is 

A M :={P6R Mkx { 0 ' ir :P(^)>0, £ P(y, x) = 1 Vy G {0,1}"}. 

ze{o,i} n 

Each Markov kernel is written as a matrix P with entries P{x\y) = P(y; x) for all x £ {0,1}", for 
all y £ {0, l} fc . The y- th row P(■[>)) is a probability distribution of the output x, given the input y. 
The set Afc )n is the 2 fc (2 n — 1)-dimensional polytope of 2 k x 2 n row-stochastic matrices, which is 
the 2 /l -th Cartesian power of A n . 


2.2. Feedforward Stochastic Networks 

We consider stochastic binary units of the following form. Consider the sigmoid function 


a : M —> [0,1]; a i-A 


1 

1 + exp(—a) 


For each input vector y £ {0, l} fc , the unit computes a scalar pre-activation value by an affine 
function v T y + b, and outputs 1 with probability Pr(l) = cr(v T y + c) or otherwise it outputs 0 with 
complementary probability, Pr(0) = 1 — a(v T y + c). 

Given an input vector y £ {0, l} fc , the probability that a feedforward layer of rri stochastic units 
outputs 2 = (z\,... ,z m ) T £ {0, l} m is given by the product of the output probabilities of the 
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Figure 1: Feedforward network with a layer of k input units, a layer of rn hidden units, and a layer 
of n output units. 


individual units in the layer, 

Pr(z)= n &(vjy + Cj) Zj (1 — cr(vjy + Cj)© Zj 

je[m] 

_ exp {(Vy + c) T z) 

E 2 'e{o,i}- ex P ((Yy + c) T z') ’ 

where V = (vi| • • • \v m ) T E R mxn and c = (ci,..., c m ) T E M m . 


Definition 1 We denote by F rn C A/. rn the set of Markov kernels that can be represented by a 
feedforward layer with k input units and m output units; that is, the set of kernels 


P{z\y) 


exp((Vy + c) T z) 
exp ((Vy + c) T z') 


Vz€{0,l} m ,ye{0,l} k , 


parametrized by V G and c E M m . 


Definition 2 We denote by F& m n = F m ,n ° F/., m C A;,.. r) the set of Markov kernels that can be 
represented by a feedforward network with k input units, m hidden units, and n output units; that 
is, the set of kernels of the from 

P(x\y)= Q{z\y)R{x\z) Vy E {0, l} k , x E {0, l} n , 

ze{o,i} m 

where Q E F k. rn and R E R m , n - 

Fig. 1 gives a schematic illustration of a feedforward network with one input, one hidden, and 
one output layer. 

The following set of probability distributions will be important in our analysis. 


Definition 3 We denote by E rn C A m the set of probability distributions on {0, l} m of the form 

exp (b T z) 


P(z) = 


E 2 'e{o,i}- exp (b T z’) 


VzG{0,l} m , 
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parametrized by b E M m . This is precisely the set of probability distributions that factorize as 

p(z)= n p)~ Zj (l-Pj) Zj Vz = ( Zl ,...,z m ) T e{ 0,l} m , 

j£[m] 

for some pj E (0, 1) for all j E [m]. 

Note that each kernel Q E F/. m represented by a feedforward layer is a tuple of 2 k factorizing 
probability distributions, Q(-\y) E E m for all y E {0, l} fc . 

2.3. Universal Approximation 

Definition 4 A set Ai C A/,. n ;.v a universal approximator if and only if every point from A/, : „ ; can 
be approximated arbitrarily well by points from Ai. This is the case if and only if Ai = A/,. n , 
where Ai denotes the closure of Ai in the Euclidean topology. 

We will study the universal approximation properties of F k,m,n in two cases. In the first case, 
only the first layer has free parameters, whereas the second layer has fixed parameters. This means 
that we consider the set R o m C A^ n for some fixed R E F min . In the second case, all 
parameters are free. 

By comparing the number of free parameters and the dimension of A k , n , it is straightforward 
to obtain the following crude lower bound on the minimal number of hidden units that suffices for 
universal approximation: 

Proposition 5 Let k > 1 and n > 1. 

• If there is an R E E m ,n with R o F = A k ,n, then m> \ prpxy2 fc (2 n — 1)]. 

• IfF k> m,n = A kt n, then m > \ (n+ ^ +1) (2 k (2 n - 1) - n)]. 

2.4. Feedforward Deterministic Networks 

Deterministic networks are special cases of stochastic networks. Consider a feedforward stochastic 
network as defined above, but where all input weights and biases, W and b, are multiplied by 
r E M. For generic choices of W and b, when r —> oo each unit outputs 0 or 1 with probability 
one, depending on its inputs. In this case, the kernels represented by the feedforward network are 
deterministic, meaning that they have the form P(x\y) = dj( y ) (x) for all x E {(), I}", for all 
y E {0, l} fc , for some function /: {0,1 } k -X {0, l} n . The feedforward networks defined by these 
limits are called linear threshold networks. 

The representation of Boolean functions by linear threshold networks (with one binary output 
unit) has been studied extensively in the literature. The problem can be beautifully described as 
the problem of classifying subsets of vertices of the k-dimensional unit cube by an arrangement of 
oriented affine hyperplanes. In (Wenzel et ah, 2000) it was shown that, for k > 2 and n = 1, the 
smallest number m of hidden units for which a linear threshold network with one hidden layer can 

compute every Boolean function satisfies 2 fc / 2 — ^ < — ^ + + 2 k < m < 2 fc - 

It is important to realize that, in the deterministic setting, if a network with m hidden units can 
represent any Boolean function /: {0, 1 \ k -X {0,1}, then a network with n ■ m hidden units can 
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represent any function g : {0, l} k -» {0, l} n . This is because one can always write g = (fi,, f n ) 
and compute the individual /$’s in parallel groups of hidden and output units. In the stochastic 
setting the same is not true. In general, a joint distribution over {0,1 \ n cannot be written in terms 
of n marginal distributions over {0,1} alone, and so, the activities of the individual output units 
cannot be computed independently from each other. 

3. Results 

We bound the minimal number of hidden units of a universal approximator from above in two 
cases. In the first case we consider a network whose second layer has fixed weights and biases. In 
the second case all weights and biases are free parameters. 

3.1. Fixed Weights in the Second Layer 

Theorem 6 Let k > 1 and n > 1. There is an R G F mn such that R o F= A whenever 

m > \2 k (2 n - 1 ). 

In view of the crude lower bound from Proposition 5, this upper bound is tight at least when k = 1. 

When there arc no input units, k = 0, we may set Fo >m = £ m and A (K? , = A n . The theorem 
generalizes to this case as: 

Proposition 7 Let n > 2. There is an R G F m „ with R o £ rn = A n , whenever m > 2 n — 1. 

This bound is always tight, since the network uses exactly 2 n — 1 parameters to approximate every 
distribution from A„ arbitrarily well. 

3.2. Changeable Weights in the Second Layer 

When both layers have changeable weights we obtain: 

Theorem 8 Let k > 1 and n > 2. Then F k,m,n = A k, n > whenever m > 2 k ~ 1 (2 n ~ 1 — 1). 

Comparison with the crude lower bound from Proposition 5 reveals that this bound is tight at least 
for (k, n) = (1,2), (2, 2), (3, 2), (1,3). 

In the case of no inputs we obtain: 

Proposition 9 Let n >2. Then F rn . n o £ rn = A n , whenever m > 2 n ~ 1 — 1. 

3.3. Outline of the Proof 

Our strategy for proving Theorem 6 and Theorem 8 can be summarized as follows: 

• First we show that the first layer of Ffc m ri can approximate Markov kernels arbitrarily well, 
which fix the state of some units, depending on the input, and have an arbitrary product 
distribution over the states of the other units. The idea is illustrated in Fig. 2. 

• Then we show that the second layer can approximate deterministic kernels arbitrarily well, 
whose rows are copies of all point measures from A n , ordered in a good way with respect to 
the different inputs. Note that the point measures are the vertices of the simplex A n . 


5 







Montufar 




Figure 2: Illustration of the construction used in our proof. Each block of hidden units is active on 
a distinct subset of possible inputs. The output layer integrates the activities of the block 
that was activated by the input, and produces corresponding activities of the output units. 


• Finally, we show that the set of product distributions of each block of hidden units is mapped 
to the convex hull of the rows of the kernel represented by the second layer, which is A n . 

• The output distributions of distinct sets of inputs is modeled individually by distinct blocks 
of hidden units and so we obtain the universal approximation of Markov kernels. 

The goal of our analysis is to construct the individual pieces of the network as compact as 
possible. In particular, Lemma 10 will provide a trick that allows us to use each block of units in 
the hidden layer for a pair of distinct input vectors at the same time. This allows us to halve the 
number of hidden units that would be needed if each input had an individual block of active hidden 
units. Similarly, Lemma 15 will provide a trick for producing more flexible mixture components at 
the output layer than simply point measures. This again allows us to halve the number of hidden 
units of the simpler construction. 

4. The Number of Hidden Units for Fixed Weights in the Second Layer 
4.1. The First Layer 

We start with the following lemma. 

Lemma 10 Let y', y" E {0, l} fc differ only in one entry, and let q', q" be any two distributions on 
{0,1}. Then F^\ can approximate the following arbitrarily well: 

( q'i ify = y' 

p(-\y) = l <?", ify = y" ■ 

[ <5o, else 

Proof Given the input weights and bias, V E M lxfc and c E M, for each input y E {0, l} fc the 
output probability is given by 

p(z = l\y) = a(Vy + c). (2) 
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Since the two vectors y',y" <£ {0, l} fc differ only in one entry, they are the vertices of an edge E 
of the /.’-dimensional unit cube. Let l E [A:] := {1,..., A:} be the entry in which they differ, with 
y[ = 0 and y" = 1. Since E is a (1-dimensional) face of the cube, there is a supporting hyperplane 
of E. This means that there are V E M lxfc and c E M with Vy + c = 0 if y E E, and Vy + c < —1 
if y E {0, l} k \ E. Let s' = a~ 1 (q'(z = 1)) and s" = a~ 1 (q"(z = 1)). We define c = ac + s' and 
V = aV + (s" — s')ei . Then, as a —> oo, 

( s', if y = y' 

Vy + c= l s", if y = y" ■ 

[ — oo, else 

Plugging this into (2) proves the claim. ■ 

Given any binary vector y E {0, l} fe , let int(y) := i 2 *~ l yi be its integer representation. 
Using the previous lemma, we obtain the following. 

Proposition 11 Let N > 1 and rri = 2 /l ' 1 N. For each y E {0, l\ k , let p(-\y) be an arbitrary 
distribution from £j\r. The model F can approximate the following kernel from arbitrarily 

well: 


P(h\y ) =6 0 (h°) ■ ■ ■ 5o(/i Lint(j/)/2j - 1 )^(^ Lint( 2/ )/2j | y ) 

x 5 0 (fi Lint(2/)/2j+1 ) • • • <5 0 (/i 2fe_1_1 ), 

where IT = (h N i+ 1 , • • •, h N ( i+l )) for all i E {0,1, ..., 2 k ~ l - 1}. 

Proof We divide the set {0, 1 \ 1 ' of all possible inputs into 2 k 1 disjoint pairs with successive 
decimal values. The z-th pair consists of the two vectors y with |_int(y)/2j = i, for all i E 
{0,... — 1}. The kernel P has the property that, for the z-th input pair, all output units 

are inactive with probability one, except those with index Ni + 1,..., N(i + 1). Given a joint 
distribution q on the states of I units, let q 3 denote the corresponding marginal distribution on the 
states of the j-th of these units. By Lemma 10, we can set 

{ Pj(-\y), if int(y) = 2i 
Pj(\y), ifint(y) = 2z +1 
So, else 


for all i E {0,..., 2 k 1 — 1} and j E [N], 


4 . 2 . The Second Layer 

For the second layer we will consider deterministic kernels. Given a binary vector 2 , let l{z) := 
[log 2 (int(z) + 1)] denote the largest j with Zj = 1. Here we set 1(0, ..., 0) = 0. Given an integer 
l E {0,..., 2 n — 1}, let bin„ (Z) denote the ?z-bit binary representation of /; that is, the vector with 
int(bin n (/)) = l. 
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Figure 3: Illustration of Lemma 13 for n = 2, N = 2" — 1 = 3. There is an arrange¬ 
ment of n hyperplanes which divides the vertices of a (2 n — 1)-dimensional cube as 
0| 112, 3|4, 5,6, 7| ■ • • |2 2 " -2 ,..., 2 2 "" 1 - 1. 


Lemma 12 Let N = 2 n — 1. The set Fjv,n can approximate the following deterministic kernel 
arbitrarily well: 

Q(-\z) = C>bin„Z(z)(') V z F {0, 1}^. 

In words, the z -th row of Q indicates the largest non-zero entry of the binary vector z. For example, 
for n = 2 we have N = 3 and 


Q 


00 01 10 11 


/I 


1 

1 

1 

1 / 


\ 000 
001 
010 
Oil 
100 
101 
110 
111 


Proof of Lemma 12 Given the input and bias weights, W E M nx,v and b E M", for each input 
z E {0,1} N the output distribution is the product distribution p(-\z) E £„ with exponential param¬ 
eters Wz + b. If sgn(Wz + b) = sgn(x — |) for some x E {0, l} n , then the product distribution 
with parameters a(Wz + b), a —> oo tends to S x . We only need to show that there is a choice of W 
and b with sgn(FFz + b) = sgn(/(z) — |), f(z) = biri n [log 2 (int(z) + 1)], for all z E {0,1}^. 
That is precisely the statement of Lemma 13. ■ 


We used the following lemma in the proof of Lemma 12. For l = 0,1,..., 2 n — 1, the l -th orthant 
of 1" is the set of all vectors r E M n with strictly positive or negative entries and int(hs(r)) = l, 
where hs denotes the entry-wise Heaviside function, assigning value 0 to negative entries and value 
1 to positive entries. 

Lemma 13 Let N = 2 n — 1. There is an affine map {(), 1} ;V -> M n ; z Wz + b, sending 

{z E {0,1}^ : l(z) = 1} to the l-th orthant of W 1 , for all l E {0,1,..., N}. 
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Proof Consider the affine map z i-a Wz + b, where b = —(1,..., 1) T and the Z-th column of W 
is 2 i+1 (bin n (Z) — |) for all Z £ {1,..., N}. For this choice, sgn (Wz + b) = sgn(bin n (/(z)) — 3) 
lies in the Z-th orthant of M n . ■ 


Lemma 13 is illustrated in Fig. 3 for n = 2. As another example, for n = 3 the affine map can 
be defined as z H > Wz + b, where 


-1\ 

/ 

f 2 

-4 

8 

-16 

32 

-64 

128 

- 1 . 

and W= I 

-2 

4 

8 

-16 

-32 

64 

128 

-1/ 

\ 

t -2 

-4 

-8 

16 

32 

64 

128 


Proposition 14 Let N = 2” — 1 and let Q be defined as in Lemma 12. Then Q o £ N = A+. 

Proof Consider a strictly positive product distribution p £ £n with p(z) = p) ^ z ‘(f — Pi) Zi for 
all z £ {0,1}^. Then p T Q £ A„ is the vector q = (qo, qi,..., gxr) with entries 


<n = 

z: l(z)=i 

Zk,k<i k<i j>i 

= { i -Pi)X\pi 

j>i 

for all / = 1..... N, and go = fl j>oPj- Therefore, 


qo 


1 - Pi 1 

ft n;=i Pi 


Vi = 1,...,TV. 


(3) 


Since 1 p Pi can be made arbitrary in (0, oo) by choosing an appropriate pi, independently of pj , for 
j < i, the quotient 51 can be made arbitrary in (0, oo) for alH £ {1,..., N}. This implies that q 
can be made arbitrary in A+. ■ 


Proof of Theorem 6 This follows from Proposition 11 and Proposition 14. 


5. The Number of Hidden Units for Changeable Weights in the Second Layer 

In order to prove Theorem 8 we will use the same construction of the first layer as in the previous 
section. For the second layer we will use the following refinement of Lemma 12. 

Lemma 15 Let n > 2 and N = 2 n_1 — 1. The set F v.n can approximate the following kernels 
arbitrarily well: 

Q(-\ z ) = A z ^bin n 2t(z)(') + (1 — ^z)^binn 2Z(z)+l(") Vz £ {0, 1}^, 
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where \ z are certain (not mutually independent) weights in [0,1]. Given any n E M + for all 
l E {0,1,..., N}, it is possible to choose the \ z ’s such that 


Ez: 


l(z)=l ‘ 


Y!z:l(z)=l^ X z) 


n VI € { 0 , 1 ,..., N}. 


In words, the z -th row of Q is a convex combination of the indicators of 21 (z) and 21 (z) + 1, 
and, furthermore, the total weight assigned to 21 relative to the total weight assigned to 21 + 1 can 
be made arbitrary for each /. For example, for n = 2 we have N = 1 and 


00 01 10 11 

Q = ( X ° t 1 “ A °) ^ 0 

V A, (l-Ar)Jl 


The sum of all weights in a given even column can be made arbitrary, relative to the sum of all 
weights in the column right next to it, for all N + 1 such pairs of columns simultaneously. 

Proof of Lemma 15 Consider the sets Z\ = {z E {0,1}^: l(z) = l}, for l = 0,1,..., N. Let 
W E M 1 ) x - v and // E M n_1 be the input weights and biases defined in Lemma 12. We define 
W and b by appending a row (p \,..., pn) on top of W' and an entry po on top of b'. 

If pj < 0 lor all j = 0,1,..., N, then z H > Wz + b maps Zf to the 2Z-th orthant of R n , for each 
l = 0,l,...,N. 

Consider now some arbitrary fixed choice of pj, j < l. Choosing pi < 0 with //;[ > Yljp lwl> 
Z\ is mapped to the 2/-th orthant. If pi — > — oo, then X z — > 1 for all 2 with / (z) = l. As we increase 
pi to a sufficiently large positive value, the elements of Zi gradually are mapped to the (21 + l)-th 
orthant. If pi —> oo, then (1 — A z ) —> 1 for all 2 with l(z) = l. By continuity, there is a choice of 

pi such that ^ i(z) n t = r l- 

Note that the images of Zj, j < l, are independent of the i- th rows of W for all i = l, ..., N. 
Hence changing pi does not have any influence on the images of Z\ nor on A z for z: l(z) < l. Tun¬ 
ing pi sequentially, starting with * = 0, we obtain a kernel that approximates any Q of the claimed 
form arbitrarily well. ■ 


Let be the collection of kernels described in Lemma 15. 

Proposition 16 Let n > 2 and N = 2 n ~ 1 — 1. Then o£ N = A+. 

Proof Consider a strictly positive product distribution p E £n with p(z) = \ 

for all 2 E { 0 , 1}^. Then p T Q E A n is a vector (go, qi, ■ ■ ■, <l2N+\ ) whose entries satisfy 

<?2 i + Q2i+1 = (1 -Pi)\\P]i 
j>i 

for alii = 1 ,,N and go + Qi = El />,; Pj- As in the proof of Proposition 14, this implies that the 
vector (go + t?i, <72 + <73, • • •, < 72 n + <72TV+i) can be made arbitrary in A+_ x . This is irrespective of 
the coefficients Ao, • • •, A at. Now all we need to show is that we can make g 2 , arbitrary relative to 
g 2 i+i for all* = 0, ..., N. 
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We have 


and 


Q 2 i = ^2 A Z p(z) 

z : l(z)=i 



x(l -Pi)(jlPj) 

j>i 


Q 2 i +1 = ^2 i 1 - A z)p(z) 

z: l(z)=i 



j>i 


Therefore, 


q 2i _ V(riKid 

® i+i ~ E, :iW =,(i-A s )(n I<j ph 2 *(i-p/.o«)' 

By Lemma 15 it is possible to choose all X z arbitrarily close to zero for all 2 with l(z) = i and have 
them transition continuously to values arbitrarily close to one (independently of the values of \ z , 
z: l(z) i). Since all are strictly positive, this implies that the quotient ^ /2 ’ | takes all values in 
(0, oo) as the A z , z: l(z) = i transition from zero to one. ■ 


Proof of Theorem 8 This follows from Proposition 11 and Proposition 16. 


6. Conclusions 

This article proves upper bounds on the minimal size of binary shallow stochastic feedforward 
networks with sigmoid activation probabilities that can approximate any stochastic function with a 
given number of binary inputs and outputs arbitrarily well. By our analysis, if all parameters of the 
network are free, 2 /,: '(2" 1 — 1) hidden units suffice, and, if only the parameters of the first layer 
are free, 2 fe-1 (2 n — 1) hidden units suffice. 

It is interesting to compare these results with what is known about universal approximation of 
Markov kernels by shallow undirected stochastic networks, called conditional restricted Boltzmann 
machines. For those networks previous work (Montufar et ah, 2014) has shown that 2 /,: '(2” — 1) 
hidden units suffice, whereby, if the number k of input units is large enough, |2 fc (2 n — 1 + 1/30) 
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suffice. These bounds are sandwiched between our bounds for feedforward networks. In the case 
of no input units, our bound 2 n_1 — 1 equals the known bounds for universal approximation of 
probability distributions by restricted Boltzmann machines (Montufar and Ay, 201 1). Hence, given 
the current state of knowledge, if we were to specify a smallest possible universal approximator of 
Markov kernels or probability distributions, feedforward networks would seem preferable. How¬ 
ever, verifying the tightness of the bounds appears to be a very challenging problem in either case. 
It has been observed that undirected networks can represent many kernels that can be represented by 
feedforward networks, especially when these are not too stochastic (Montufar et al., 2014; Montufar, 
2014). In future work it would be interesting to compare the representational power of both network 
architectures in more detail. 

We think that it is possible to adapt our analysis to cover deep architectures as well. This should 
allow us to conclude that a multilayer feedforward stochastic network with k input units and n 
output units is a universal approximator of Markov kernels if it has about 2 n_ 1 hidden layers, each 
containing about n2 k ~ 1 units. The verification of this claim is left for future work. In relation with 
this, the results presented in this paper should be helpful for analyzing the relative representational 
power of shallow vs. deep stochastic feedforward networks, a topic that has attracted much interest 
in recent years and that still poses a great many questions. 
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